TeMex: The Web Template Extractor
نویسندگان
چکیده
This paper presents and describes TeMex, a site-level web template extractor. TeMex is fully automatic, and it can work with online webpages without any preprocessing stage (no information about the template or the associated webpages is needed) and, more importantly, it does not need a predefined set of webpages to perform the analysis. TeMex only needs a URL. Contrarily to previous approaches, it includes a mechanism to identify webpage candidates that share the same template. This mechanism increases both recall and precision, and it also reduces the amount of webpages loaded and processed. We describe the tool and its internal architecture, and we present the results of its empirical evaluation.
منابع مشابه
Bridging the Gap: from Multi Document Template Detection to Single Document Content Extraction
Template Detection algorithms use collections of web documents to determine the structure of a common underlying template. Content Extraction algorithms instead operate on a single document and use heuristics to determine the main content. In this paper we propose a way to combine the reliability and theoretic underpinning of the first world with the single document based approach of the latter...
متن کاملSecurity Efficiency Analysis of a Biometric Fuzzy Extractor for Iris Templates
A Biometric fuzzy extractor scheme for iris templates was recently presented in [3]. This fuzzy extractor binds a cryptographic key with the iris template of a user, allowing to recover such cryptographic key by authenticating the user by means of a new iris template from her. In this work, an analysis of the security efficiency of this fuzzy extractor is carried out by means of a study about t...
متن کاملWS-NEXT, a Web Services Network Extractor Toolkit
In this article, a Web services network extractor toolkit, WS-NEXT (WS Network EXtractor Toolkit), is presented. WS-NEXT allows extraction of interaction and dependency WS networks. Networks can be extracted from syntactic and semantic WS descriptions. Such network structures can be analyzed using complex network tools. We provide examples of networks extracted from a publicly available WS coll...
متن کاملA secure authentication scheme based on fuzzy extractor
The biometrics-based authentication schemes are more security and reliable than the traditional authentication schemes, and it is the inevitable trend of future development. However, between the existing schemes, the security of user’s biometric template usua lly be ignored, the user’s information security suffering from a great threat because of that. Recently, Yan et al. proposed a secure bio...
متن کاملEnhancing the Invisible Web
In recent years, a large amount of information has been placed in databases across the globe, and published through dynamically generated Web pages. The evolution of the so-called Invisible (or Hidden) Web constitutes both an opportunity and an issue for Web-based information extractors. This article describes the architecture of an Invisible-Web Extractor, whose primal goal is to enhance the v...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015